Skip to content

bugprone-narrowing-conversions#9505

Open
Jacobfaib wants to merge 1 commit into
NVIDIA:mainfrom
Jacobfaib:jacobf/2026-06-17/bugprone-narrowing-conversions
Open

bugprone-narrowing-conversions#9505
Jacobfaib wants to merge 1 commit into
NVIDIA:mainfrom
Jacobfaib:jacobf/2026-06-17/bugprone-narrowing-conversions

Conversation

@Jacobfaib

Copy link
Copy Markdown
Contributor

Description

closes

Checklist

  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@Jacobfaib Jacobfaib self-assigned this Jun 17, 2026
@Jacobfaib Jacobfaib requested review from a team as code owners June 17, 2026 17:36
@github-project-automation github-project-automation Bot moved this to Todo in CCCL Jun 17, 2026
@cccl-authenticator-app cccl-authenticator-app Bot moved this from Todo to In Review in CCCL Jun 17, 2026
@coderabbitai

This comment was marked as off-topic.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (4)
cub/cub/detail/warpspeed/allocators/smem_allocator.cuh (1)

53-60: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: Handle align == 0 before round-up math in alloc.
With the current formula on Line 56, align defaulting to 0 makes (align - 1) underflow and can force ptrAllocation32 to 0, which breaks allocation monotonicity. The cast on Line 59 then hides this by folding a wrapped unsigned delta into int. Guard zero alignment (e.g., normalize to 1 or assert/reject) before computing ptrAllocation32.

libcudacxx/benchmarks/bench/find_if/basic.cu (1)

27-33: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: mismatch_point can be elements when MismatchAt is 1.0, but state.add_global_memory_reads<T>(mismatch_point + 1) then records elements + 1 reads. Cap the reported read count to elements to avoid skewed benchmark metadata for that axis value.

thrust/benchmarks/bench/none_of/basic.cu (1)

21-27: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: for MismatchAt == 1.0, mismatch_point becomes elements, so state.add_global_memory_reads<T>(mismatch_point + 1) overreports by one. Please clamp the accounting value to elements to keep benchmark stats accurate.

cub/test/catch2_test_device_reduce_env_api.cu (1)

418-446: ⚠️ Potential issue | 🟠 Major

important: Line 443 validates device results after launching work on stream_ref, but this block does not synchronize that stream first. Unlike the other stream tests in this file, this can race outstanding work and make the assertion nondeterministic. Add stream.sync(); before the REQUIRE checks.

🧹 Nitpick comments (11)
thrust/thrust/detail/complex/cexpf.h (1)

61-61: ⚡ Quick win

important: Line 61 still evaluates the exponent expression in uint32_t before casting, so underflow can occur in unsigned space and then convert to int with implementation-defined semantics. Cast to signed before arithmetic (e.g., const int exp_bits = static_cast<int>(hx >> 23); *expt = exp_bits - (0x7f + 127) + static_cast<int>(k);).

cudax/benchmarks/bench/copy/copy_bench.cu (1)

25-25: important: this benchmark .cu change should include before/after performance evidence (and SASS-diff check if codegen changed), otherwise we can’t confirm it is regression-free.

As per coding guidelines, “Do not commit SASS code changes without running benchmarks to check for performance regressions,” and benchmark changes should follow docs/cub/benchmarking.rst.

Source: Coding guidelines

libcudacxx/benchmarks/bench/all_of/basic.cu (1)

27-27: important: please attach benchmark comparison output for this arithmetic/codegen-touching benchmark change (base vs new), so we can confirm no performance regression.

As per coding guidelines, benchmark updates should follow docs/cub/benchmarking.rst, and .cu SASS-affecting changes require benchmark validation.

Source: Coding guidelines

libcudacxx/benchmarks/bench/rotate_copy/basic.cu (1)

24-24: important: this benchmark arithmetic change should include base/new benchmark results to validate no throughput regression in rotate_copy.

As per coding guidelines, benchmark changes should follow docs/cub/benchmarking.rst, and .cu changes that can alter generated code should be benchmark-validated.

Source: Coding guidelines

thrust/benchmarks/bench/any_of/basic.cu (1)

21-21: important: please provide before/after any_of benchmark output for this change to confirm no regression in measured performance.

As per coding guidelines, benchmark updates should follow docs/cub/benchmarking.rst, and .cu changes with potential codegen impact should be validated with benchmarks.

Source: Coding guidelines

c/experimental/stf/test/test_cuda_kernel.cu (1)

20-21: ⚡ Quick win

suggestion: declare tid and nthreads as const int since they are not reassigned; this matches the repository rule that non-mutating locals should be const.

As per coding guidelines: “All non-const variables must use const.”

Source: Coding guidelines

c/experimental/stf/test/test_host_launch.cu (1)

20-21: ⚡ Quick win

suggestion: make tid and nthreads const int; both are immutable after initialization and should follow the project’s const-local requirement.

As per coding guidelines: “All non-const variables must use const.”

Source: Coding guidelines

c/experimental/stf/test/test_logical_data_with_place.cu (1)

26-26: ⚡ Quick win

suggestion: change int i to const int i here; the value is not mutated and the repo style requires const locals when possible.

As per coding guidelines: “All non-const variables must use const.”

Source: Coding guidelines

thrust/examples/sum_columns.cu (1)

43-43: ⚡ Quick win

suggestion: add <cstddef> for std::ptrdiff_t and make this local const auto since it is not reassigned. As per coding guidelines, files must include headers for symbols they use (no transitive reliance), and non-const variables should be const.

Source: Coding guidelines

thrust/examples/sum_rows.cu (1)

43-43: ⚡ Quick win

suggestion: include <cstddef> for std::ptrdiff_t and change row_idx_end to const auto because it is never modified. As per coding guidelines, do not rely on transitive includes and use const for non-mutated variables.

Source: Coding guidelines

c/parallel/test/test_segmented_reduce.cpp (1)

416-417: ⚡ Quick win

suggestion: add <cstddef> because these lines use std::ptrdiff_t, and make both iterator locals const auto since they are read-only. As per coding guidelines, files must include headers for used symbols and non-mutated locals should be const.

Also applies to: 487-488

Source: Coding guidelines


ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: db866020-3a6c-4d48-a5b0-c6db8daef821

📥 Commits

Reviewing files that changed from the base of the PR and between a4ffbcf and bfc3ca8.

📒 Files selected for processing (167)
  • .clang-tidy
  • c/experimental/stf/test/test_async_resources_handle.cu
  • c/experimental/stf/test/test_cuda_kernel.cu
  • c/experimental/stf/test/test_host_launch.cu
  • c/experimental/stf/test/test_logical_data_with_place.cu
  • c/experimental/stf/test/test_stackable.cu
  • c/experimental/stf/test/test_stream_ctx_override.cu
  • c/experimental/stf/test/test_task_get_graph.cu
  • c/parallel/test/test_histogram.cpp
  • c/parallel/test/test_segmented_reduce.cpp
  • c/parallel/test/test_segmented_sort.cpp
  • c/parallel/test/test_util.h
  • c2h/include/c2h/cpu_timer.h
  • cub/benchmarks/bench/merge/merge_common.cuh
  • cub/cub/agent/agent_batch_memcpy.cuh
  • cub/cub/agent/agent_histogram.cuh
  • cub/cub/agent/agent_radix_sort_histogram.cuh
  • cub/cub/agent/agent_radix_sort_onesweep.cuh
  • cub/cub/agent/agent_radix_sort_upsweep.cuh
  • cub/cub/agent/agent_reduce_by_key.cuh
  • cub/cub/agent/agent_rle.cuh
  • cub/cub/agent/agent_scan.cuh
  • cub/cub/agent/agent_scan_by_key.cuh
  • cub/cub/agent/agent_select_if.cuh
  • cub/cub/agent/agent_three_way_partition.cuh
  • cub/cub/agent/agent_topk.cuh
  • cub/cub/agent/agent_unique_by_key.cuh
  • cub/cub/agent/single_pass_scan_operators.cuh
  • cub/cub/block/block_radix_rank.cuh
  • cub/cub/block/specializations/block_reduce_warp_reductions.cuh
  • cub/cub/detail/warpspeed/allocators/smem_allocator.cuh
  • cub/cub/detail/warpspeed/look_ahead.cuh
  • cub/cub/detail/warpspeed/squad/squad.cuh
  • cub/cub/detail/warpspeed/sync_handler.cuh
  • cub/cub/device/dispatch/dispatch_scan_by_key.cuh
  • cub/cub/device/dispatch/kernels/kernel_histogram.cuh
  • cub/cub/device/dispatch/kernels/kernel_reduce_deterministic.cuh
  • cub/cub/device/dispatch/kernels/kernel_scan_lookahead.cuh
  • cub/cub/device/dispatch/kernels/kernel_segmented_reduce.cuh
  • cub/cub/util_ptx.cuh
  • cub/cub/warp/specializations/warp_scan_shfl.cuh
  • cub/examples/block/example_block_radix_sort.cu
  • cub/examples/block/example_block_reduce.cu
  • cub/examples/block/example_block_scan.cu
  • cub/examples/device/example_device_partition_flagged.cu
  • cub/examples/device/example_device_partition_if.cu
  • cub/examples/device/example_device_select_flagged.cu
  • cub/examples/device/example_device_select_if.cu
  • cub/examples/device/example_device_select_unique.cu
  • cub/examples/device/example_device_sort_find_non_trivial_runs.cu
  • cub/examples/device/example_device_topk_pairs.cu
  • cub/test/catch2_large_array_sort_helper.cuh
  • cub/test/catch2_large_problem_helper.cuh
  • cub/test/catch2_segmented_sort_helper.cuh
  • cub/test/catch2_test_block_load_to_shared.cu
  • cub/test/catch2_test_block_scan.cu
  • cub/test/catch2_test_block_shuffle.cu
  • cub/test/catch2_test_device_reduce_env_api.cu
  • cub/test/catch2_test_device_scan.cuh
  • cub/test/catch2_test_device_segmented_radix_sort_pairs.cu
  • cub/test/catch2_test_device_segmented_scan_api.cu
  • cub/test/catch2_test_device_segmented_scan_noncommutative.cu
  • cub/test/catch2_test_device_select_env_api.cu
  • cub/test/catch2_test_device_topk_common.cuh
  • cub/test/catch2_test_device_transform.cu
  • cub/test/catch2_test_device_transform_api.cu
  • cub/test/catch2_test_launch_wrapper.cu
  • cub/test/catch2_test_memcpy_vectorized_copy.cu
  • cub/test/catch2_test_radix_operations.cu
  • cub/test/catch2_test_warp_exchange.cuh
  • cub/test/catch2_test_warp_load.cu
  • cub/test/catch2_test_warp_merge_sort.cu
  • cub/test/catch2_test_warp_scan.cu
  • cub/test/catch2_test_warp_scan_api.cu
  • cub/test/catch2_test_warp_scan_partial.cu
  • cub/test/catch2_test_warp_scan_partial_helper.cuh
  • cub/test/catch2_test_warp_scan_partial_invalid.cu
  • cub/test/catch2_test_warp_store.cu
  • cub/test/test_allocator.cu
  • cub/test/test_util.h
  • cub/test/warp/catch2_test_warp_reduce_batched.cu
  • cub/test/warp/catch2_test_warp_segmented_reduce.cu
  • cudax/benchmarks/bench/copy/copy_bench.cu
  • cudax/examples/simple_p2p.cu
  • cudax/examples/stdexec_stream.cu
  • cudax/examples/vector_add.cu
  • cudax/test/copy/copy.cu
  • cudax/test/copy_bytes/mdspan_d2h_h2d.cu
  • cudax/test/copy_bytes/mdspan_d2h_h2d_relaxed.cu
  • cudax/test/cuco/hyperloglog/test_hyperloglog.cu
  • libcudacxx/benchmarks/bench/adjacent_find/basic.cu
  • libcudacxx/benchmarks/bench/all_of/basic.cu
  • libcudacxx/benchmarks/bench/any_of/basic.cu
  • libcudacxx/benchmarks/bench/equal/basic.cu
  • libcudacxx/benchmarks/bench/find/basic.cu
  • libcudacxx/benchmarks/bench/find_if/basic.cu
  • libcudacxx/benchmarks/bench/find_if_not/basic.cu
  • libcudacxx/benchmarks/bench/is_heap/basic.cu
  • libcudacxx/benchmarks/bench/is_heap_until/basic.cu
  • libcudacxx/benchmarks/bench/is_partitioned/basic.cu
  • libcudacxx/benchmarks/bench/is_sorted/basic.cu
  • libcudacxx/benchmarks/bench/is_sorted_until/basic.cu
  • libcudacxx/benchmarks/bench/mismatch/basic.cu
  • libcudacxx/benchmarks/bench/none_of/basic.cu
  • libcudacxx/benchmarks/bench/rotate/basic.cu
  • libcudacxx/benchmarks/bench/rotate_copy/basic.cu
  • libcudacxx/benchmarks/bench/shift_left/basic.cu
  • libcudacxx/benchmarks/bench/shift_right/basic.cu
  • libcudacxx/include/cuda/__barrier/barrier_block_scope.h
  • libcudacxx/test/libcudacxx/cuda/ccclrt/hierarchy/hierarchy_smoke.cu
  • libcudacxx/test/libcudacxx/cuda/containers/buffer/transform.cu
  • libcudacxx/test/support/test_pstl.h
  • nvbench_helper/nvbench_helper/nvbench_helper.cuh
  • test/cuda_smoke/cuda_runtime_smoke.cu
  • thrust/benchmarks/bench/all_of/basic.cu
  • thrust/benchmarks/bench/any_of/basic.cu
  • thrust/benchmarks/bench/equal/basic.cu
  • thrust/benchmarks/bench/find/basic.cu
  • thrust/benchmarks/bench/find_if/basic.cu
  • thrust/benchmarks/bench/find_if_not/basic.cu
  • thrust/benchmarks/bench/is_partitioned/basic.cu
  • thrust/benchmarks/bench/is_sorted/basic.cu
  • thrust/benchmarks/bench/is_sorted_until/basic.cu
  • thrust/benchmarks/bench/mismatch/basic.cu
  • thrust/benchmarks/bench/none_of/basic.cu
  • thrust/examples/bucket_sort2d.cu
  • thrust/examples/cuda/wrap_pointer.cu
  • thrust/examples/discrete_voronoi.cu
  • thrust/examples/dot_products_with_zip.cu
  • thrust/examples/monte_carlo.cu
  • thrust/examples/monte_carlo_disjoint_sequences.cu
  • thrust/examples/padded_grid_reduction.cu
  • thrust/examples/sort.cu
  • thrust/examples/sum_columns.cu
  • thrust/examples/sum_rows.cu
  • thrust/examples/transform_iterator.cu
  • thrust/testing/binary_search_vector.cu
  • thrust/testing/catch2_test_adjacent_difference.cu
  • thrust/testing/catch2_test_transform.cu
  • thrust/testing/cuda/copy_if.cu
  • thrust/testing/cuda/sort.cu
  • thrust/testing/cuda/transform.cu
  • thrust/testing/dereference.cu
  • thrust/testing/for_each.cu
  • thrust/testing/gather.cu
  • thrust/testing/generate.cu
  • thrust/testing/merge.cu
  • thrust/testing/merge_by_key.cu
  • thrust/testing/partition.cu
  • thrust/testing/remove.cu
  • thrust/testing/replace.cu
  • thrust/testing/reverse.cu
  • thrust/testing/scan.cu
  • thrust/testing/shuffle.cu
  • thrust/testing/tabulate.cu
  • thrust/testing/transform_scan.cu
  • thrust/testing/unique_by_key.cu
  • thrust/testing/zip_function.cu
  • thrust/thrust/detail/complex/catrig.h
  • thrust/thrust/detail/complex/catrigf.h
  • thrust/thrust/detail/complex/cexp.h
  • thrust/thrust/detail/complex/cexpf.h
  • thrust/thrust/detail/complex/math_private.h
  • thrust/thrust/system/cuda/detail/core/util.h
  • thrust/thrust/system/cuda/detail/reduce.h
  • thrust/thrust/system/cuda/detail/reduce_by_key.h
  • thrust/thrust/system/cuda/detail/set_operations.h
💤 Files with no reviewable changes (1)
  • .clang-tidy

Comment thread c2h/include/c2h/cpu_timer.h
{
auto h_pairs = thrust::make_zip_iterator(h_res_keys.begin(), h_res_values.begin());
thrust::sort(h_pairs, h_pairs + h_res_keys.size());
thrust::sort(h_pairs, h_pairs + static_cast<std::ptrdiff_t>(h_res_keys.size()));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

important: std::ptrdiff_t is introduced on Line 74 without adding <cstddef>. Add an explicit #include <cstddef> to keep this TU self-sufficient and avoid transitive-include dependency.

As per coding guidelines, “Files must include all headers related to the symbols that they are using and relying on transitive header inclusion is not allowed.”

Source: Coding guidelines

for (int j = 0; j < N; ++j)
{
host_data[i * N * 2 + j] = i * N + j;
host_data[static_cast<std::size_t>(i) * N * 2 + j] = static_cast<int>(static_cast<std::size_t>(i) * N + j);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

important: Line 369 hardcodes row stride as N * 2 instead of using Ld, which makes the test silently wrong if Ld is changed independently. Use static_cast<std::size_t>(i) * Ld + j to match the declared stride contract.

Comment on lines +73 to +74
for (int i = static_cast<int>(cuda::gpu_thread.rank(cuda::grid)); i < a.size();
i += static_cast<int>(cuda::gpu_thread.count(cuda::grid)))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

suggestion: use cuda::std::span<T>::size_type (or auto) for the grid-stride loop index/stride instead of int. On Line 73 and Line 74, int can overflow for large buffers while a.size() is wider, which can break termination/correct indexing in stress cases.

__global__ void increment_kernel(int* p, int n)
{
int idx = blockIdx.x * blockDim.x + threadIdx.x;
int idx = static_cast<int>(blockIdx.x * blockDim.x + threadIdx.x);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

important: Line 18 should declare idx as const because it is never reassigned.
As per coding guidelines, “All non-const variables must use const.”

Source: Coding guidelines

@github-actions

Copy link
Copy Markdown
Contributor

😬 CI Workflow Results

🟥 Finished in 5h 52m: Pass: 99%/503 | Total: 22d 17h | Max: 2h 52m | Hits: 26%/3096939

See results here.

@miscco miscco left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this is just noise

@github-project-automation github-project-automation Bot moved this from In Review to In Progress in CCCL Jun 22, 2026
@Jacobfaib

Copy link
Copy Markdown
Contributor Author

yeah this is just noise

I disagree. This checks makes developers carefully consider the size type and does illustrate some possible real bugs. There's a lot of places where we take for granted that something fits in an int when we could/should instead use unsigned values to either be safe or to telegraph the true range of the value.

Granted the way I've silenced a lot of these is noise (by just static_cast-ing them), but this is because the real fix of changing sizing types in CUB needs careful consideration.

@fbusato

fbusato commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

I don't have a strong preference. The warning is useful. This will force developers to think about potential data loss. On the other hand, it is very verbose. The thread hierarchy API will help a lot for device code (instead of blockIdx.x * ...).
Also, is the warning overlapped with -Wnarrowing?

Comment on lines 24 to +26
const auto common_prefix = state.get_float64("MismatchAt");
const auto mismatch_point = cuda::std::clamp<std::size_t>(elements * common_prefix, 0, elements - 2);
const auto mismatch_point = cuda::std::clamp<std::size_t>(
static_cast<std::size_t>(static_cast<double>(elements) * common_prefix), std::size_t{0}, elements - 2);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just ensure that common_prefix is size_t

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants